Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

نویسندگان

Michalis Papakostas

Evaggelos Spyrou

Theodoros Giannakopoulos

Giorgos Siantikos

Dimitris Sgouropoulos

Phivos Mylonas

Fillia Makedon

چکیده

Emotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be recognized using several modalities such as analyzing facial expressions, speech, physiological parameters (e.g., electroencephalograms, electrocardiograms) etc. However, measuring of these modalities may be difficult, obtrusive or require expensive hardware. In that context, speech may be the best alternative modality in many practical applications. In this work we present an approach that uses a Convolutional Neural Network (CNN) functioning as a visual feature extractor and trained using raw speech information. In contrast to traditional machine learning approaches, CNNs are responsible for identifying the important features of the input thus, making the need of hand-crafted feature engineering optional in many tasks. In this paper no extra features are required other than the spectrogram representations and hand-crafted features were only extracted for validation purposes of our method. Moreover, it does not require any linguistic model and is not specific to any particular language. We compare the proposed approach using cross-language datasets and demonstrate that it is able to provide superior results vs. traditional ones that use hand-crafted features.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speech Emotion Recognition Using Scalogram Based Deep Structure

Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...

متن کامل

Music Emotion Recognition via End-to-End Multimodal Neural Networks

Music emotion recognition (MER) is a key issue in user contextaware recommendation.Many existingmethods require hand-crafted features on audio and lyrics. Here we propose a new end-to-end method for recognizing emotions of tracks from their acoustic signals and lyrics via multimodal deep neural networks. We evaluate our method on about 7,000 K-pop tracks labeled as positive or negative emotion....

متن کامل

A Weighted Discrete KNN Method for Mandarin Speech and Emotion Recognition

Speech signal is a rich source of information and convey more than spoken words, and can be divided into two main groups: linguistic and nonlinguistic. The linguistic aspects of speech include the properties of the speech signal and word sequence and deal with what is being said. The nonlinguistic properties of speech have more to do with talker attributes such as age, gender, dialect, and emot...

متن کامل

Learning Multi-level Deep Representations for Image Emotion Classification

In this paper, we propose a new deep network that learns multi-level deep representations for image emotion classification (MldrNet). Image emotion can be recognized through image semantics, image aesthetics and low-level visual features from both global and local views. Existing image emotion classification works using hand-crafted features or deep features mainly focus on either low-level vis...

متن کامل

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

Automatic speech recognition (ASR) on video data naturally has access to two modalities: audio and video. In previous work, audio-visual ASR, which leverages visual features to help ASR, has been explored on restricted domains of videos. This paper aims to extend this idea to open-domain videos, for example videos uploaded to YouTube. We achieve this by adopting a unified deep learning approach...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Computation

دوره 5 شماره

صفحات -

تاریخ انتشار 2017

Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

نویسندگان

چکیده

منابع مشابه

Speech Emotion Recognition Using Scalogram Based Deep Structure

Music Emotion Recognition via End-to-End Multimodal Neural Networks

A Weighted Discrete KNN Method for Mandarin Speech and Emotion Recognition

Learning Multi-level Deep Representations for Image Emotion Classification

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

عنوان ژورنال:

اشتراک گذاری